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It is widely held that a substantial genetic component underlies Bipolar Disorder (BD) and 
other neuropsychiatric disease traits. Recent efforts have been aimed at understanding the 
genetic basis of disease susceptibility, with genome-wide association studies (GWAS) unveiling 
some promising associations. Nevertheless, the genetic etiology of BD remains elusive with a 
substantial proportion of the heritability which has been estimated to be 80% based on twin 
and family studies unaccounted for by the specific genetic variants identified by large-scale 
GWAS. Furthermore, functional understanding of associated loci generally lags discovery. 
Studies we report here provide considerable support to the claim that substantially more re- 
mains to be gained from GWAS on the genetic mechanisms underlying BD susceptibility, and 
that a large proportion of the variation in disease risk may be uncovered through integrative 
functional genomic approaches. We combine recent analytic advances in heritability esti- 
mation and polygenic modeling and leverage recent technological advances in the generation 
of -omics data to evaluate the nature and scale of the contribution of functional classes of 
genetic variation to a relatively intractable disorder. We identified cis eQTLs in cerebellum 
and parietal cortex that capture more than half of the total heritability attributable to SNPs 
interrogated through GWAS and showed that eQTL-based heritability estimation is highly 
tissue-dependent. Our findings show that a much greater resolution may be attained than 
has been reported thus far on the number of common loci that capture a substantial propor- 
tion of the heritability to disease risk and that the functional nature of contributory loci may 
be clarified en masse. 

Recent advances have facilitated high-throughput explorations of function-related aspects of the genome 
These developments have yielded comprehensive maps of functional elements, providing an unparalleled 
resource to infer the phenotypic consequences of genetic variation. Our group has been particularly inter- 
ested in clarifying the impact of regulatory variation on pathophysiology ^ and in integrating heterogeneous 
genomic datasets to expand on GWAS findings. 

For many complex traits, there is an enormous, much-lamented gap between estimates of heritability 
derived from population studies (e.g., family and twin studies) and the proportion of variation explained 
by specific genetic loci identified by genome- wide scans A second equally fundamental lacuna exists, 

between the set of genetic variants identified by GWAS and the functional role of the implicated variants in 
mediating the trait. 

Recent reports from our group and others have evaluated the utility of expression quantitative trait loci 
(eQTL) and more recently methylation quantitative trait loci (mQTL) ^ and microRNA quantitative 



trait loci (miRNAQTL) to enhance discovery of the genetic basis of complex traits. All together, these 
studies have demonstrated that trait associations from a comprehensive catalog of published GWAS as 
well as from the top ranks of GWAS results, at least for some complex traits, are significantly enriched for 
functional quantitative trait loci of molecular-level phenotypes, raising the possibility of utilizing functional 
genomics to increase the power of a genome-wide scan to detect true associations as well as to clarify the 
nature of the identified associations. However, it remains to be seen to what extent the findings emerging 
from functional genomic studies (e.g., eQTLs) may explain the so-called missing heritability characteristic 
of GWAS findings. 

A flurry of recent studies have utilized random effects models (linear mixed modeling) ^'^•^^ to estimate the 
"chip" heritability due to causal loci tagged by SNPs interrogated in GWAS and a weighted risk score analytic 
approach (polygenic modeling) ^^'^'"^ to quantify the contribution of SNPs that fail to reach genome-wide 
significance in GWAS scans. However, the findings from the application of these methods to GWAS thus 
far have yielded little insight into the nature of the polygenic component to trait variation. Bipolar Disorder 
(BD) has been seen as especially intractable and has contributed little to illuminate the nature of the genetic 
architecture of disease risk. We hypothesized that the convergence of functional genomic approaches and 
these recently developed techniques for analysis of GWAS data may shed light on the number of contributory 
loci for BD and provide a global molecular perspective on their mode(s) of functional mediation. Although 
a recent study concluded that BD and Schizophrenia share a polygenic component with tens of thousands 
of common genetic variants of small effect size likely to be contributory, studies we report here show that a 
much greater resolution may be attained on the number of common loci that capture a substantial proportion 
of the heritability to disease risk and that the functional nature of contributory loci may indeed be clarified 
en masse. 

We had conducted genome- wide expression profiling to map eQTLs (and niQTLs) in the cerebellum 
cortex, as previously described ^. For the present study, we also utilized the results of eQTL mapping 
in parietal cortex. We classified eQTLs into cis and trans regulators of gene expression phenotypes on the 
basis of the SNP's distance to its target gene and on the strength of the evidence for association with gene 
expression (see Methods). Using the Wellcome Trust Case Control Consortium (WTCCC) GWAS study in 
BD , we observed a highly significant enrichment for parietal cortex cis eQTLs and cerebellum cis eQTLs 
(p < 0.001 for each) among the top SNPs (defined as p < 0.001) (Figure 1) relative to random sets of SNPs 
(n = 1000) matched on minor allele frequency and location with respect to nearest gene. In contrast, no 
enrichment was found (p > 0.05) for cis eQTLs identified in lymphoblastoid cell lines (LCLs). 

We therefore quantified the contribution of eQTLs in the separate tissues to the heritability of disease 
risk. Following Yang et al. we utilized a linear mixed model (LMM): 



Y = Xl3 + g + e 



var{Y) = Aal + lal 

where Y is a phenotype vector of size A'^ x 1, /3 is a vector of fixed effects, g is a vector of random ad- 
ditive genetic effects (the "polygenic component" of trait variation) from the set of eQTLs under study 
(more generally, any set of QTLs with a priori support from functional genomics, e.g., mQTL ^^^^^^^ or 
mlRNAQTL lO'i^)^ and e is a vector of residuals. This model leads to two variance components, namely 
the additive genetic variance a^g captured by the tested SNPs and the residual variance cr^e- We estimated 
the (narrow-sense) heritability, a^g / {cr^g + <J^e)j explained by the cis eQTLs and, separately, the trans 
eQTLs (despite differential power) These estimates, derived from the genotyped SNPs, quantify the 
proportion of phenotypic variance attributable to causal variants in linkage disequilibrium (LD) with the 
tested eQTL SNP sets. A genetic relationship matrix (GRM), denoted by A (with entries Ay between pairs 
of individuals i and j), was estimated from each SNP set, and variances were estimated using restricted 
maximum likelihood (REML) We were interested in the heritability captured by the cis eQTLs (and 

the trans eQTLs) as a proportion of the heritability captured by all SNPs interrogated in the GWAS and 
passing QC filters (see Methods), h^eqti / h^aii, hereafter referred to as the heritability concentration index. 
We demonstrated, on theoretical grounds (see Methods) and empirically, that the heritability concentration 
index is the same whether calculated on the observed scale or estimated on the liability scale (the latter 
premised on a continuous liability threshold model). Furthermore, it is independent of disease prevalence 
(K) or ascertainment bias (P, the proportion of cases in the samples, which may be >K) (see Methods). 

To evaluate the contribution of eQTLs to the (narrow-sense) heritability of BD, we utilized the TGen+GAIN 
dataset, which consists of data from the Bipolar Genome Study (TGen) and from an earlier study (GAIN) 
conducted under the GAIN initiative (Table 1). All together, the sample set included in our analysis consists 
of 2,191 cases and 1,434 controls. Selecting samples so that \Aij\ < 0.025 for all pairs i and j (n = 3,189 
individuals) and conducting extensive QC on the genotype data (see Methods), we performed the heritabil- 
ity estimation analysis in the TGen+GAIN dataset with and without the study ID as covariate, with no 
substantial difference in the estimates. Indeed, consistent with a previous report on the WTCCC study 
of BD the genome-wide SNPs (n > 600,000) explained 35% (s.e. = 6%) of the phenotypic variance on 
the liability scale (assuming disease prevalence K = 0.01). Remarkably, we found that the cerebellum cis 
eQTLs (n = 27,107) and parietal cortex cis eQTLs (n = 26,979) included in our analysis had heritability 
concentration index h^eqti / h^aii of 0.57 and 0.58, respectively. The heritability captured by the cis eQTLs 



previously identified in LCLs ^ was less than 1%, which demonstrates the tissue dependence of eQTL-based 
heritability estimation. The inclusion of 20 principal components in the LMM as fixed effects yielded similar 
estimates (Supplemental Table 1). 

As the results for parietal cortex and cerebellum were found to be similar (and to significantly differ from 
the results for LCLs), we illustrate our approach in the remainder of the paper with the cerebellum eQTLs 
unless explicitly stated. When we evaluated each GWAS (TGen and GAIN) separately and partitioned 
the genetic variance captured by the cis eQTLs onto the individual autosomes (by fitting the GRMs of 
the chromosomes simultaneously in a joint analysis), we found that the estimates of variance explained by 
each chromosome (Figure 2) were highly correlated between the two studies (Pearson correlation = 0.60). 
Furthermore, the signals in the chromosomal estimates in each GWAS (Figure 2) are not driven by the 
large chromosomes. Notably, each GWAS study showed a peak in heritability estimate on chromosome 11. 
Supplemental Table 2 provides a list of the chromosome 11 transcripts implicated by this analysis, including 
some with a number of independent (r^ < 0.10) cis eQTLs. In particular, using gene set enrichment analysis 

we found that the chromosome 11 transcripts with 5 or more cis eQTLs contributing to our estimate 
of heritability are significantly enriched for Immunoglobulin C-2 Type (IGc2) proteins (Benjamini Hochberg 
adjusted p = 6.2 x 10"^; Supplemental Table 3), which raises the question of an infectious/infiammatory 
etiology to BD ■^'^ or a novel CNS mechanism for this immunoglobulin. 

Although our study was not sufficiently powered to reliably identify trans eQTLs, we found that the set 
of SNPs (n = 6,892) included in our analysis with association p < 1.9 x 10"^ (= 0.05/# of genes tested) with 
at least one gene expression trait in cerebellum had heritability concentration index of 0.11. 

Since experimental or genotyping artifacts as case-control differences may appear as "heritability" (more 
so than in the case of quantitative traits, which are less likely to be correlated with these artifacts) we 
investigated the effect of QC thresholds on the estimates of heritability captured by the genome-wide SNPs 
as well as the cis eQTLs and the trans eQTLs among the interrogated SNPs (see Methods) and found these 
estimates to be stable. 

To test for the presence of any inflation in these estimates, we conducted heritability estimation using the 
genome-wide SNPs and, separately, the subset of cis eQTLs on permuted traits (n=1000) while conditioning 
on the same corresponding genetic relationship matrix A (one for each SNP set). This analysis showed 
that estimates of heritability of simulated phenotypes derived from the genome-wide SNPs and from the cis 
eQTLs, given their standard error, were consistent with zero heritability, as we would expect from a simulated 
trait with no real association with genotype. Supplemental Figure 1 illustrates the (null) distribution of the 
heritability estimate for the cis eQTLs. 

SNP-based heritability estimation is susceptible to confounding by population stratification The use of 



principal components (as fixed effects) in the LMM model may not adequately correct for such confounding, 
and indeed principal component-adjusted estimates may yield substantially similar values as the non-adjusted 
ones in the presence of fine-scale population structure To quantify the effect of population structure, 
we estimated heritability from the sum of heritabilities obtained from two disjoint sets of chromosomes (chr 
1-10 and chr 11-22), which we then compared with the estimate derived from genome-wide SNPs, following 
Yang et al. '^^^ . We found that the heritability concentration index was largely unaffected (0.57 [joint] vs. 
0.55 [disjoint]). 

Contributions to the heritability concentration index from causal variants tagged by the cis eQTLs 
may be distorted by patterns of LD. Indeed, regions of strong LD may amplify a SNP-based estimate of 
heritability while contributions from poorly tagged variants may be underestimated. We therefore utilized 
an LD-adjusted kinship matrix recently developed by Speed et al. to tease apart the impact of local LD 
from that of the architecture of causal variants on our estimate of the heritability concentration index. We 
calculated h^eqti I all for cis eQTLs using a modified genetic relatedness matrix derived from scaled SNP 
genotypes. The heritability concentration index did not change substantially with the use of LD-adjusted 
kinship matrix (LD-adjusted h^eqti I all = 0.572 vs. non-adjusted h^eqti I all = 0.570). 

The concentration of heritability observed for the cis eQTLs was noteworthy. We took a second analytic 
approach to evaluate to what extent cis eQTLs may have a collective effect on disease susceptibility. The 
polygenic modeling (PM) approach has been utilized in several recent studies to demonstrate a polygenic 
component to an array of complex human traits, including schizophrenia rheumatoid arthritis, myocardial 
infarction and coronary artery disease We selected an LD-pruned set of cis eQTLs that meet a P- value 
threshold in a "discovery" GWAS (TGen) and calculated a polygenic risk score (see Methods) from this 
set for each individual in a "replication" GWAS (GAIN) using the risk alleles and the effects sizes from 
the "discovery" TGen study. We followed the LD parameters used by Purcell et al. to enable direct 
comparison (see Methods). We utilized several P- value thresholds for association with BD in TGen (namely, 
P-value TGEN < 0.0001, 0.01, 0.05, and 0.10) to define a set of cis eQTLs and evaluated the polygenic 
risk score for association with case-control status in the second independent GWAS (GAIN) using logistic 
regression. 

Table 2 illustrates the dependence of the association of the polygenic risk score with disease status on 
the P-value tgen threshold. In particular, the set of cis eQTLs defined by P-value tgen < 0.05 (n = 2,375 
SNPs) showed the most significant association with case-control status. Figure 3 shows the chromosomal 
location of this particular set of cis eQTLs, which appear to be scattered uniformly throughout the genome. 

Several observations are worth noting here. First, this set of eQTLs constitutes a much smaller number 
of SNPs (than has previously been reported) that underlie a polygenic variation in the trait, suggesting that 



the use of eQTLs can facilitate unprecedented resolution of the polygenic component to disease. Second, 
the association of the polygenic risk score with case-control status is more significant for the set of cis eQTLs 
with P-value tgen < 0.05 than it is for the larger set of interrogated SNPs that satisfy P-value tgen < 
0.05. Third, our approach not only tests whether a polygenic component predicts disease risk (as other PM 
studies do) it also highlights a potential functional mechanism for the polygenic component. Finally, both 
statistical approaches, LMM and PM, yield consistent findings on the effect of cis eQTLs, in aggregate, on 
BD susceptibility. 

In summary, we have undertaken to quantify the heritability captured by a functional class of quantitative 
trait loci for an important complex disorder. Our study not only provides support for the role of common 
genetic variation in disease susceptibility, but importantly also yields a functional basis for the polygenic 
variation in the trait. This understanding of the genetic architecture of disease risk could have direct clinical 
utility and inform the design of future studies. 



METHODS 

eQTL Mapping 

The results of our eQTL studies in cerebellum were previously reported ^. Here we also present results in 
parietal cortex. Briefly, DNA and RNA of cerebellum samples from 153 individuals of European descent were 
obtained from the Stanley Medical Research Institute. Genotyping was done on the Affymetrix GeneChip 
Mapping 5.0 Array (Affymetrix, Santa Clara, CA, USA). Genotype data can be found in the Stanley 
Genomics Database. Genome- wide expression profiling was performed using the Affymetrix Human Gene 
1.0 ST Array (GEO GSE35974 and GSE35977, for cerebellum and parietal cortex, respectively). SNPs 
with call rates less than 99% were filtered, as were SNPs that departed from Hardy- Weinberg equilibrium 
(P<0.001) and SNPs with MAF<10%. Principal components analysis, as implemented in Eigenstrat was 
used to test for the existence of population structure. We performed imputation using MACH vl.O We 
excluded from analysis all probes that could be mapped to multiple genome regions as well as all probes that 
contain common SNPs (MAF>0.01) on the basis of 1000 Genomes and HapMap data. ComBat was used 
to adjust for batch effect. Surrogate variable analysis was conducted and identified surrogate variables were 
regressed out. Quantile normalization was used on the residuals. Linear regression with dosage of the minor 
allele for each SNP was then performed to identify eQTLs. In this study, a cis region is defined as within 
4 MB of the probe site, while a trans region refers to the rest of the genome. A threshold of <0.01 was 
used for the cis analysis, while the trans analysis threshold was 0.05 divided by the number of probes. As 
our primary interest was in quantifying the heritability captured by these sets of SNPs (and certain subsets 
thereof), we started from these thresholds. 

Genome-wide Association Studies in BD 

Two genome-wide studies of Bipolar Disorder were utilized for our study of heritability estimation. The 
TGen GWAS consists of 1,190 cases from the Bipolar Genome Study and 401 controls. In the origi- 
nal GWAS study, QC procedure excluded SNPs with low MAF (<1%), significant departure from Hardy- 
Weinberg equilibrium in controls (p < lO"*"), low call rate (<95%), and other criteria. A second GWAS, 
from the GAIN initiative consists of 1,001 cases and 1,033 controls of European descent. As in the 
TGEN study, SNPs were not included in the analysis if the MAF was less than 1%, the SNP violated Hardy- 
Weinberg equilibrium (p < 10"®) in control samples, if the call rate was low (< 95%), if there were 3 or more 
Mendelian errors, or if there was more than one discrepancy among duplicate samples. 

Following Lee et al. we performed additional QC steps to ensure the robustness of our estimates 
of heritability such as excluding SNPs with p < 0.05 for Hardy- Weinberg equilibrium and for missingness- 
difference between cases and controls. Only autosomal SNPs were included in our heritability estimation 
analysis. 



Heritability Estimation on the Observed Scale and on the Liability Scale 

The liability threshold model presupposes an underlying continuous random variable that defines case- 
control status. Cases are those subjects for which the liability exceeds a given threshold t. For our purposes, 
suppose that the population prevalence is K . Suppose P is the proportion of cases in the sample set; in 
general, this proportion, an ascertainment parameter, may not be a random sample from the population. 
Then the relationship between the heritability on the observed scale and the heritability on the liability 
scale is given by the following expression 

' " ° P(l-P) $(tf 

where $(4) is the y-value of the standard normal curve at the point t. Note the same scaling factor 
between the observed scale and the liability scale applies to the estimate of heritability for the cis eQTLs or 
trans eQTLs as for the genome- wide SNPs: 

P(l-P) 

Thus, the ratio of the estimate of heritability attributable to cis eQTLs (or to trans eQTLs) relative to the 
estimate of heritability attributable to all interrogated SNPs, which we call eQTL "heritability concentration 
index", is the same whether on the observed scale or on the liability scale, and is independent of disease 
prevalence and of ascertainment: 

CIS ] r CIS 1 

7 2 \liability — [,2 \ohserved 
'^all '^all 

In our heritability estimation analysis, we selected samples so that \Aij\ < 0.025 for all pairs i and j (leaving 
n = 3,189 individuals). 

Simulation Analysis Under the Null 

We conducted simulations (n = 1000) to test for the presence of inflation in our estimates of the heritability 
captured by the genome-wide SNPs and, separately, by the cis eQTLs. In this analysis, we preserved the 
genotype correlation structure, utilized the genetic relationship matrix defined by each set of SNPs, and 
calculated the heritability for a permuted trait (n = 1000). An empirical p- value was generated for the 
estimate of heritability, defined as the number of times the estimate for a simulated trait matches or exceeds 
the observed estimate. Additionally, we determined the number of times an estimate for a simulated trait 
is consistent with zero heritability, e.g., in the case of cis eQTLs, the set of points ( h^^^ , SE {h^cis)) 
[0,l]x[0,l] that satisfy 



hi, - 2SE{hl,) < 

where SE [hl,^ is the standard error for the estimate. 
Polygenic Modeling with Functional Variation 

We utiHzed polygenic niodehng to evaluate the effect of large numbers of weakly associated SNPs 
characterized by very small allele frequency differences between cases and controls. To facilitate direct 
comparison with the Purcell et al. study, we pruned a given SNP set (e.g., the cis eQTLs) to filter SNPs in 
strong LD with other SNPs (using a pairwise of 0.25, within a 200-SNP sliding window). Using the set 
of risk alleles from the resulting LD-pruned set and the corresponding effect size from a "discovery" GWAS, 
we calculated, in a "validation" GWAS, a polygenic score from the log odds ratio-weighted sum of risk allele 
count Xij for each individual j : 

i 

For polygenic modeling, we evaluated the sets of cis eQTLs defined by P-value tgen < 0.0001, 0.01, 
0.05, and 0.10. We tested the calculated polygenic score for association with disease status. 



Figure 1. Enrichment of eQTLs among the top SNPs from the WTCCC study. The highest 
ranked SNPs (p<0.001) in the WTCCC study of BD were found to be enriched for cerebellum cis eQTLs 
and parietal cortex cis eQTLs relative to random SNPs matched on minor allele frequency and location 
with respect to nearest gene. The black dot represents the observed count and the histogram depicts the 
empirical (null) distribution of the eQTL count generated from the randomly drawn SNPs. (In contrast, 
no evidence for eQTL enrichment was observed in LCLs.) 
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Figure 2. Partitioning of variance captured by cis eQTLs by chromosome. The estimates of 
variance, from the two GWAS GAIN and TGen, captured by the cerebellum cis eQTLs on each chromosome 
were highly correlated (Pearson correlation = 0.60). Red corresponds to the GAIN study while orange the 
TGen study. The estimates shown here are on the observed scale. 




Figure 3. eQTL-based polygenic modeling. We selected an LD-pruned set of cis eQTLs that 
meet a P-value threshold in a "discovery" GWAS (TGen) - P-value tgen < 0.0001, 0.01, 0.05, and 0.10 
were tested - and calculated a polygenic risk score from this set for each individual in a "replication" GWAS 
(GAIN) using the risk alleles and the effects sizes from the "discovery" TGen study. The set of cis eQTLs 
defined by P-value tgen < 0.05 (n = 2,375 SNPs) showed the most significant association with case-control 
status in GAIN using logistic regression. Each red mark indicates a cis eQTL SNP included in the polygenic 
model. 




Table 1. Genome-wide association studies of Bipolar Disorder evaluated in our study. 



Study 


Cases 


Controls 


TGEN 


1190 


401 


GAIN 


1001 


1033 



Table 2. Polygenic modeling with eQTLs. Using the cis eQTLs, we determined, for each individual 
in the "repUcation" GAIN study, a polygenic risk score, which is defined as the sum of the risk allele counts 
weighted by the log odds ratio using the risk alleles and the effect sizes from the "discovery" TGen study The 
polygenic risk score was then tested for association with disease status in the GAIN study. An LD-pruned 
SNP set consisting of cis eQTLs with p < 0.05 in the TGEN study showed the most significant association 
with case control status in the GAIN study. 



TGen p-value threshold 


p-value of association of polygenic score with disease status 


p < 0.0001 


0.894 


p < 0.01 


0.0245 


p < 0.05 


0.01 


p < 0.1 


0.0115 
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